The aim of this proyect is analyze a given Diamonds dataset. We were looking for insights that determinates the influence of each variable into the final diamond price.
Keep in mind that those libraries have to be installed in your enviroment
#data libraries
import numpy as np
import pandas as pd
#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
from IPython.display import set_matplotlib_formats
warnings.filterwarnings('ignore')
%matplotlib inline
Load Diamonds' Dataset with pandas
diamonds_raw = pd.read_csv('../data/raw/diamonds_train.csv')
diamonds_raw
print(f'*Our dataset has {len(diamonds_raw.index)} rows and {len(diamonds_raw.columns)} columns*')
print(f'The dataset has those null values: \n{diamonds_raw.isnull().sum()} null values')
numerical_cols = diamonds_raw.select_dtypes(include=['int','float']).columns.to_list()
object_cols =diamonds_raw.select_dtypes(include=['object']).columns.to_list()
f'Dataset has these numerical cols:{numerical_cols} and these object cols:{object_cols}'
diamonds_raw.info(memory_usage='deep')
diamonds_raw.describe(include='all')
We missed price per carat, an important measure that could give us an idea about a diamonds with similar characteristics, but different weight. If we will have a diamond with same carat weight, we could go deeper to estimate the price with other Cs of the diamonds. It also will provide as with a range of values for "similar" diamonds.
Add price per carat column
diamonds_raw['price_per_carat'] = diamonds_raw['price']/diamonds_raw['carat']
diamonds_raw
It doesn't make sense that the min of x, y or z is 0. They are size values, so it can not be 0. We will delete them.
size_ceros = (diamonds_raw['x']==0) | (diamonds_raw['z']==0) | (diamonds_raw['y']==0)
diamonds_raw.loc[size_ceros,:]
print(f'There are {len(diamonds_raw.loc[size_ceros,:])} rows with value 0 in one of their size measure')
diamonds_no_ceros = diamonds_raw.drop(diamonds_raw.loc[size_ceros,:].index, axis=0, inplace=False)
diamonds_no_ceros
diamonds_duplicated_filter = diamonds_no_ceros[diamonds_no_ceros.duplicated(keep='first')]
diamonds_duplicated_filter.sort_values(by="cut", ascending=True)
diamonds = diamonds_no_ceros.drop(diamonds_duplicated_filter.index, axis=0, inplace=False)
diamonds
diamonds.describe(include='all')
diamonds['cut'].unique()
diamonds['cut'].describe()
order_dict = {'Ideal': 0, 'Premium': 1, 'Very Good': 2, 'Good': 3,'Fair':4}
d_cut_ordered = diamonds.iloc[diamonds['cut'].map(order_dict).argsort()]
cut_mean = d_cut_ordered.groupby('cut')['cut'].count()
cut_dict = cut_mean.to_dict()
cut_dict
fig, ax = plt.subplots(2,figsize=(17,8))
a = sns.countplot(x='cut', data = d_cut_ordered,ax=ax[0],palette='PuBu')
sns.boxplot(x="cut", y="price", data=d_cut_ordered, ax=ax[1],palette='PuBu')
plt.show()
fig = px.histogram(d_cut_ordered, x="price", facet_col="cut",color='cut',title='Price by cut',)
fig.show()
cut_proportion = diamonds.groupby('cut').size().reset_index()
# Data to plot
labels = cut_proportion['cut']
sizes = cut_proportion[0]
# Plot
plt.pie(sizes,labels=labels,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
The color has a great effect on the value of diamonds and diamond experts have created a color grade scale starting at the letter D right through to the letter Z. Therefore, the further down the scale you go, the lower the value of the diamond.
diamonds['color'].unique()
diamonds['color'].describe()
fig, ax = plt.subplots(2,figsize=(17,9))
sns.countplot(x='color', data = diamonds.sort_values(by="color"),ax=ax[0],palette='PuBu')
sns.boxplot(x="color", y="price", data=diamonds.sort_values(by="color"), ax=ax[1],palette='PuBu')
plt.show()
fig = px.histogram(diamonds.sort_values(by="color"), x="price", facet_col="color",color='color',title='Price Distribution by Cut')
fig.show()
color_proportion = diamonds.groupby('color').size().reset_index()
# Data to plot
labels = color_proportion['color']
sizes = color_proportion[0]
# Plot
plt.pie(sizes,labels=labels,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
A magnifying glass with 10 times magnification (10x lens) is used in order to see the inclusions (natural blemishes) present in the diamond. An expert certified in gemmology looks at the diamond under the 10x magnifying glass in order to count and position the inclusions present in the diamond. The more inclusions a diamond has, the lower its value. The most ought-after diamonds are those whose inclusions cannot be seen at ten times magnification: these are Flawless (FL) and Internally Flawless (IF) diamonds. Conversely, P1, P2 and P3 diamonds are the least valued as their inclusions are visible to the human eye.
diamonds['clarity'].unique()
diamonds['clarity'].describe()
order_dict = {'IF': 0, 'VVS1': 1, 'VVS2': 2, 'VS1': 3,'VS2':4,'SI1':5,'SI2':6,'I1':7}
d_clarity_orderded = diamonds.iloc[diamonds['clarity'].map(order_dict).argsort()]
fig, ax = plt.subplots(2,figsize=(17,9))
sns.countplot(x='clarity', data = d_clarity_orderded,ax=ax[0],palette='PuBu')
sns.boxplot(x="clarity", y="price", data=d_clarity_orderded, ax=ax[1],palette='PuBu')
plt.show()
fig = px.histogram(d_clarity_orderded, x="price", facet_col="clarity",color='clarity',title='Price by Clarity')
fig.show()
clarity_proportion = diamonds.groupby('clarity').size().reset_index()
# Data to plot
labels = clarity_proportion['clarity']
sizes = clarity_proportion[0]
# colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
# Plot
plt.pie(sizes,labels=labels,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
diamonds['carat'].describe()
mode = diamonds['carat'].mode()
print(f'The mode of carat is{mode}')
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['carat'], ax=ax_box)
sns.distplot(diamonds['carat'], ax=ax_hist)
fig = px.scatter(diamonds, x="carat", y="price", color="carat", marginal_y="violin",marginal_x="box", trendline="ols", template="simple_white")
fig.show()
diamonds['price'].describe()
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.20, .80)}, figsize=(12,8))
sns.boxplot(diamonds['price'], ax=ax_box)
sns.distplot(diamonds['price'], ax=ax_hist)
diamonds['price_per_carat'].describe()
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.20, .80)}, figsize=(12,8))
sns.boxplot(diamonds['price_per_carat'], ax=ax_box)
sns.distplot(diamonds['price_per_carat'], ax=ax_hist)
fig = px.scatter(diamonds, x="carat", y="price_per_carat", color="carat", marginal_y="violin",marginal_x="box", trendline="ols", template="simple_white")
fig.show()
For analyzing Depth, we have to keep in mind that depth is z/mean(y,x). The depth of a diamond might also be called the “height”: it is the distance from the table to the culet (the pointed tip) of the diamond.
diamonds['depth'].describe()
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['depth'], ax=ax_box)
sns.distplot(diamonds['depth'], ax=ax_hist)
We observed that data trends to a normal distribution where a well proportion diamonds is more expensive than a bigger one. Also, with a well proportion one, carat influence on it's price
fig = px.scatter(diamonds, x="depth", y="price", color="carat", marginal_y="violin",marginal_x="box", trendline="ols", template="simple_white")
fig.show()
Table percentage is calculated by dividing the width of the table by the overall width of the diamond.
diamonds['table'].describe()
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['table'], ax=ax_box)
sns.distplot(diamonds['table'], ax=ax_hist)
With the plot below we can not conclude that te more talbe the more price.
fig = px.scatter(diamonds, x="table", y="price" , color="table",marginal_y="violin",marginal_x="box", trendline="ols", template="simple_white")
fig.show()
diamonds[['x','y','z']].describe()
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['x'], ax=ax_box).set_title("Size -X")
sns.distplot(diamonds['x'], ax=ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['y'], ax=ax_box).set_title("Size -Y")
sns.distplot(diamonds['y'], ax=ax_hist)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,8))
sns.boxplot(diamonds['z'], ax=ax_box).set_title("Size -Z")
sns.distplot(diamonds['z'], ax=ax_hist)
It seems to be a relation between size and prize. These plots seems follow a logarithmic trend, where the more x or z or y, the more price
fig, ax = plt.subplots(nrows = 3, ncols = 1, figsize = (20, 20))
sns.scatterplot(x = 'x', y = 'price', hue = 'color', data = diamonds, ax = ax[0]).set_title("Price / X")
sns.scatterplot(x = 'y', y = 'price', hue = 'color', data = diamonds, ax = ax[1]).set_title("Price / Y")
sns.scatterplot(x = 'z', y = 'price', hue = 'color', data = diamonds, ax = ax[2]).set_title("Price / Z")
From the graphic above, we can see that the more proportion between table and depth, the more price.
fig = px.scatter(diamonds, x="table", y="depth", color="cut" , size='price', marginal_y="violin",marginal_x="box", trendline="ols", template="simple_white")
fig.show()
correlation_df = diamonds.corr()
correlation_df
corrtrue = correlation_df.apply(lambda x: x > 0.8)
lista = corrtrue.apply(lambda x: ','.join(x.index[x]), axis=1)
lista
f, ax = plt.subplots(figsize=(19, 6))
sns.heatmap(diamonds.corr(), annot=True,fmt='.2f', linewidths=6, center=0,ax=ax, cmap="YlGnBu").set_title("Correlation betweeen quantitative variables")
We can know what is the relation between two categorical table with these contingency tables before measuring the power of this relation.
# tabla de contingencia en porcentajes relativos segun corte
pd.crosstab(index=diamonds['cut'], columns=diamonds['color']
).apply(lambda r: r/r.sum() *100,
axis=1)
pd.crosstab(index=diamonds['cut'], columns=diamonds['clarity']).apply(lambda r: r/r.sum() *100,
axis=1)
pd.crosstab(index=diamonds['color'], columns=diamonds['clarity']).apply(lambda r: r/r.sum() *100,
axis=1)
We use cramer V to measure the strenghen of the correlation between two quantitaive variables
import scipy.stats as ss
import numpy as np
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
clarity_vs_cut = cramers_v(diamonds["clarity"],diamonds["cut"])
color_vs_cut = cramers_v(diamonds["color"],diamonds["cut"])
clarity_vs_color = cramers_v(diamonds["clarity"],diamonds["color"])
print(f'The relation between clarity and cut is:{clarity_vs_cut}.')
print(f'The relation between color and cut is:{color_vs_cut}.')
print(f'The relation between clarity and color is:{clarity_vs_color}.')
All the relations are below 0,3, so it is a weak realtion between variables
With the previous analysis, we can create our own Rapaport table based on our dataset.
diamonds.pivot_table(index=['color'], columns=['cut','clarity'], values=['price_per_carat'])
custom_rapaport = diamonds.groupby(['cut','clarity','color','carat'])['price_per_carat'].mean().reset_index()
custom_rapaport
After a first analysis, we conclude to delete those events where size of the diamonds was 0. After that, we deleted duplicated data.
Qualitative variables
For each variable, we create two plots. The first one gives us an idea about the distribution with the barchar. With the boxplot, we can know more about quartils, median and outlyers. In the second plot, we cross each variable with the "main" variable of this analysis, the price. We can extract these conclusions for the quanlitative variables analyzed:
Quantitaive variables For each variable, we create four plots. The first one gives us an idea about the distribution with the barchar. With the boxplot, we can know more about quartils, median and outlyers. In the second plot, we cross each variable with the "main" variable of this analysis, the price. For the fourth, we create a piechart where we could know the percentage of each type in the category. We can extract these conclusions for the quantitative variables analyzed:
correlation_df.to_csv(f'../data/results/diamond_correlation.csv', index=False)
diamonds.to_csv(f'../data/results/diamond_clean.csv', index=False)
custom_rapaport.to_csv(f'../data/results/custom_rapaport.csv', index=False)